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Abstract 

We introduce a novel training principle for generative probabilistic models that is an al¬ 
ternative to maximum likelihood. The proposed Generative Stochastic Networks (GSN) 
framework generalizes Denoising Auto-Encoders (DAE) and is based on learning the tran¬ 
sition operator of a Markov chain whose stationary distribution estimates the data distri¬ 
bution. The transition distribution is a conditional distribution that generally involves a 
small move, so it has fewer dominant modes and is unimodal in the limit of small moves. 
This simplifies the learning problem, making it less like density estimation and more akin 
to supervised function approximation, with gradients that can be obtained by backprop. 
The theorems provided here provide a probabilistic interpretation for denoising autoen¬ 
coders and generalize them; seen in the context of this framework, auto-encoders that 
learn with injected noise are a special case of GSNs and can be interpreted as generative 
models. The theorems also provide an interesting justification for dependency networks 
and generalized pseudolikelihood and define an appropriate joint distribution and sampling 
mechanism, even when the conditionals are not consistent. GSNs can be used with missing 
inputs and can be used to sample subsets of variables given the rest. Experiments validat¬ 
ing these theoretical results are conducted on both synthetic datasets and image datasets. 
The experiments employ a particular architecture that mimics the Deep Boltzmann Ma¬ 
chine Gibbs sampler but that allows training to proceed with backprop through a recurrent 
neural network with noise injected inside and without the need for layerwise pretraining. 
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1. Introduction 


Research in deep learning (see Bengio (2009) and Bengio et al. (2013a) for reviews) grew 
from breakthroughs in unsupervised learning of representations, based mostly on the Re¬ 


stricted Boltzmann Machine (RBM) (Hinton et ah, 2006), auto-encoder variants (Bengio 


The first two authors had an equally important contribution 
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Figure 1: Top: A denoising auto-encoder defines an estimated Markov chain where the tran¬ 
sition operator first samples a corrupted X from C{X\X) and then samples a reconstruction 
from P 0 {X\X), which is trained to estimate the ground truth P{X\X). Note how for any 
given X, P{X\X) is a much simpler (roughly unimodal) distribution than the ground truth 
P{X) and its partition function is thus easier to approximate. Bottom: More generally, a 
GSN allows the use of arbitrary latent variables H in addition to X, with the Markov chain 
state (and mixing) involving both X and H. Here H is the angle about the origin. The 
GSN inherits the benefit of a simpler conditional and adds latent variables, which allow 


more powerful deep representations in which mixing is easier (Bengio et ah, 2013b). 






















Vincent et al., 2008), and sparse coding variants (Lee et al., 2007 Ranzato 


20071). However, the most impressive recent results have been obtained with purely 


supervised learning techniques for deep networks, in particular for speech recognition (]Dahl 


et al. 

2010 

Deng et al. 

Mol 

Seide et al. 

2011 

) and object recognition 

(Krizhevsky et al. 

2012) 

. The latest breakthrough in object recognition ( 

Krizhevsky et al. 

2012 

) was achieved 


with fairly deep convolutional networks with a form of noise injection in the input and 


hidden layers during training, called dropout (Hinton et ah, 2012) 


In all of these cases, the availability of large quantities of labeled data was critical. 

On the other hand, progress with deep unsupervised architectures has been slower, with 
the established approaches with a probabilistic footing being the Deep Belief Network 
(DBN) ([Hinton et al. 2006| ) and the Deep Boltzmann Machine (DBM) (Salakhutdinov 


and Hinton| 2009). Although single-layer unsupervised learners are fairly well developed 


and used to pre-train these deep models, jointly training all the layers with respect to a 
single unsupervised criterion remains a challenge, with a few techniques arising to reduce 


that difficulty (Montavon and Muller, 2012; Goodfellow et ah, 2013). In contrast to recent 


progress toward joint supervised training of models with many layers, joint unsupervised 
training of deep models remains a difficult task. 


In particular, the normalization constant involved in complex multimodal probabilistic 
models is often intractable and this is dealt with using various approximations (discussed 
below) whose limitations may be an important part of the difficulty for training and using 
deep unsupervised, semi-supervised or structured output models. 


Though the goal of training large unsupervised networks has turned out to be more elusive 
than its supervised counterpart, the vastly larger available volume of unlabeled data still 
beckons for efficient methods to model it. Recent progress in training supervised models 
raises the question: can we take advantage of this progress to improve our ability to train 
deep, generative, unsupervised, semi-supervised or structured output models? 


This paper lays theoretical foundations for a move in this direction through the following 
main contributions: 


1 — Intuition: In Section we discuss what we view as basic motivation for studying 
alternate ways of training unsupervised probabilistic models, i.e., avoiding the intractable 
sums or maximization involved in many approaches. 


2 — Training Framework: We start Section by presenting our recent work on the 
generative view of denoising auto-encoders (Section |3.1[ ). We present the walkback algorithm 
which addresses some of the training difficulties with denoising auto-encoders (Section |3.2[). 


We then generalize those results by introducing latent variables in the framework to define 
Generative Stochastic Networks (GSNs) (Section 3.4). GSNs aim to estimate the data- 
generating distribution indirectly, by parametrizing the transition operator of a Markov 
chain rather than directly parametrizing a model P{X) of the observed random variable X. 
Most critically, this framework transforms the unsupervised density estimation problem into 
one which is more similar to supervised function approximation. This enables training by 
(possibly regularized) maximum likelihood and gradient descent computed via simple back- 
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propagation, avoiding the need to compute intractable partition functions. Depending on 
the model, this may allow us to draw from any number of recently demonstrated supervised 
training tricks. For example, one could use a convolutional architecture with max-pooling 


for parametric parsimony and computational efficiency, or dropout (Hinton et ah, 2012) to 
prevent co-adaptation of hidden representations. 

3 — General theory: Training the generative (decoding / denoising) component of a 
GSN P{X\h) with noisy representation h is often far easier than modeling P{X) explicitly 
(compare the blue and red distributions in Figure [^. We prove that if our estimated 
P{X\h) is consistent (e.g. through maximum likelihood), then the stationary distribution 
of the resulting Markov chain is a consistent estimator of the data-generating density, P{X) 
(Section 3.1 and Appendix]^. 


4 — Consequences of theory: We show that the model is general and extends to a wide 
range of architectures, including sampling procedures whose computation can be unrolled 
as a Markov Chain, i.e., architectures that add noise during intermediate computation 
in order to produce random samples of a desired distribution (Theorem]^. An exciting 
frontier in machine learning is the problem of modeling so-called structured outputs, i.e., 
modeling a conditional distribution where the output is high-dimensional and has a complex 
multimodal joint distribution (given the input variable). We show how GSNs can be used 
to support such structured output and missing values (Section |3.6|). 


5 — Example application: In Section [4.2| we show an example application of the GSN 
theory to create a deep GSN whose computational graph resembles the one followed by 
Gibbs sampling in deep Boltzmann machines (with continuous latent variables), but that 
can be trained efficiently with back-propagated gradients and without layerwise pretraining. 
Because the Markov Chain is defined over a state {X, h) that includes latent variables, we 
reap the dual advantage of more powerful models for a given number of parameters and 
better mixing in the chain as we add noise to variables representing higher-level information, 
first suggested by the results obtained by |Bengio et al. (2013b) and Luo et al. (2013). The 
experimental results show that such a model with latent states indeed mixes better than 
shallower models without them (Table [^. 

6 — Dependency networks: Finally, an unexpected result falls out of the GSN theory: 
it allows us to provide a novel justification for dependency networks (Heckerman et al., 2000| 
and for the first time define a proper joint distribution between all the visible variables that 

learned by such models (Section |3.8[). 


IS 


2. Summing over too many major modes 

The approach presented in this paper is motivated by a difficulty often encountered with 
probabilistic models, especially those containing anonymous latent variables. They are 
called anonymous because no a priori semantics are assigned to them, like in Boltzmann 
machines, and unlike in many knowledge-based graphical models. Whereas inference over 
non-anonymous latent variables is required to make sense of the model, anonymous variables 
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are only a device to capture the structure of the distribution and need not have a clear 
human-readable meaning. 


However, graphical models with latent variables often require dealing with either or both 
of the following fundamentally difficult problems in the inner loop of training, or to actually 
use the model for making decisions: inference (estimating the posterior distribution over 
latent variables h given inputs x) and sampling (from the joint model of h and x). However, 
if the posterior P[h\x) has a huge number of modes that matter, then the approximations 
made may break down. 


Many of the computations involved in graphical models (inference, sampling, and learn¬ 
ing) are made intractable and difficult to approximate because of the large number of 
non-negligible modes in the modeled distribution (either directly P{x) or a joint distri¬ 
bution P(x,h) involving latent variables h). In all of these cases, what is intractable is 
the computation or approximation of a sum (often weighted by probabilities), such as a 
marginalization or the estimation of the gradient of the normalization constant. If only a 
few terms in this sum dominate (corresponding to the dominant modes of the distribution), 
then many good approximate methods can be found, such as Monte-Carlo Markov chains 
(MCMC) methods. 


Deep Boltzmann machines ( ]Salakhutdinov and Hinton 2009) combine the difficulty of 
inference (for the positive phase where one tries to push the energies associated with the 
observed x down) and also that of sampling (for the negative phase where one tries to push 
up the energies associated with x’s sampled from P{x)). Sampling for the negative phase 


is usually done by MCMC, although some unsupervised learning algorithms (Collobert 


and Weston, 2008; Gutmann and Hyvarinen, 2010; Bordes et ah, 2013) involve “negative 


examples” that are sampled through simpler procedures (like perturbations of the observed 
input, in a spirit reminiscent of the approach presented here). Unfortunately, using an 
MCMC method to sample from P{x, h) in order to estimate the gradient of the partition 
function may be seriously hurt by the presence of a large number of important modes, as 
argued below. 


To evade the problem of highly multimodal joint or posterior distributions, the currently 
known approaches to dealing with the above intractable sums make very strong explicit 
assumptions (in the parametrization) or implicit assumptions (by the choice of approxima¬ 
tion methods) on the form of the distribution of interest. In particular, MCMC methods 
are more likely to produce a good estimator if the number of non-negligible modes is small: 
otherwise the chains would require at least as many MCMC steps as the number of such 
important modes, times a factor that accounts for the mixing time between modes. Mixing 
time itself can be very problematic as a trained model becomes sharper, as it approaches 
a data-generating distribution that may have well-separated and sharp modes (i.e., mani¬ 
folds) (Bengio et ah, 2013b). 

We propose to make another assumption that might suffice to bypass this multimodality 
problem: the effectiveness of function approximation. As is typical in machine learning, we 
postulate a rather large and flexible family of functions (such as deep neural nets) and then 


(Bengio et ah, 2013b 


5 
























use all manner of tricks to pick a member from that combinatorially large family (i.e. to 
train the neural net) that both fits observed data and generalizes to unseen data well. 


In particular, the GSN approach presented in the next section relies on estimating the 
transition operator of a Markov chain, e.g. P{xt\xt-i) or P{xt, ht\xt-i, ht-i). Because each 
step of the Markov chain is generally local, these transition distributions will often include 
only a very small number of important modes (those in the neighborhood of the previous 
state). Hence the gradient of their partition function will be easy to approximate. For 
example consider the denoising transitions studied by Bengio et al. (2013c) and illustrated 
in Figure where xt-i is a stochastically corrupted version of xt-i and we learn the 
denoising distribution P{x\x). In the extreme case (studied empirically here) where P{x\x) 
is approximated by a unimodal distribution, the only form of training that is required 
involves function approximation (predicting the clean x from the corrupted x). 


Although having the true P{x\x) turn out to be unimodal makes it easier to find an 
appropriate family of models for it, unimodality is by no means required by the GSN 
framework itself. One may construct a GSN using any multimodal model for output (e.g. 
mixture of Gaussians, RBMs, NADE, etc.), provided that gradients for the parameters of 
the model in question can be estimated (e.g. log-likelihood gradients). 


The approach proposed here thus avoids the need for a poor approximation of the gradient 
of the partition function in the inner loop of training, but still has the potential of capturing 
very rich distributions by relying mostly on “function approximation”. 


Besides the approach discussed here, there may well be other very different ways of 
evading this problem of intractable marginalization, including approaches such as sum- 


product networks (Poon and Domingos, 2011), which are based on learning a probability 


function that has a tractable form by construction and yet is from a flexible enough family of 
distributions. Another interesting direction of investigation that avoids the need for MGMC 


and intractable partition functions is the variational auto-encoder ( 

Kingma and Welling 

2014 

Gregor et al. 

2014 

Mnih and Gregor 

2014 

Rezende et al. 

2014) and related directed 

models ( 

Bornschein and Bengio 

2014; 

Ozair and Bengio 

2014 

), which rely on learned 


approximate inference. 


3. Generative Stochastic Networks 

In this section we work our way from denoising auto-encoders (DAE) to generative stochas¬ 
tic networks (GSN). We illustrate the usefulness of denoising auto-encoders being applied 
iteratively as a way to generate samples (and model a distribution). We introduce the 
walkback training algorithm and show how it can facilitate the training. 

We generalize the theory to GSNs, and provide a theorem that serves as a recipe as to 
how they can be trained. We also reference a classic result from matrix perturbation theory 
to analyze the behavior of GSNs in terms of their stationary distribution. 

We then study how GSNs may be used to fill missing values and theoretical conditions 
for estimating associated conditional samples. Finally, we connect GSNs to dependency 
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nets and show how the GSN framework fixes one of the main problems with the theoretical 
analysis of dependency nets and propose a particular way of sampling from them. 


3.1 Denoising auto-encoders to model probability distributions 


Assume the problem we face is to construct a model for some unknown data-generating 
distribution P{X) given only examples of X drawn from that distribution. In many cases, 
the unknown distribution P{X) is complicated, and modeling it directly can be difficult. 

A recently proposed approach using denoising auto-encoders (DAE) transforms the dif¬ 
ficult task of modeling P{X) into a supervised learning problem that may be much easier 
to solve. The basic approach is as follows: given a clean example data point X from P{X), 
we obtain a corrupted version X by sampling from some corruption distribution C{X\X). 
For example, we might take a clean image, X, and add random white noise to produce X. 
We then use supervised learning methods to train a function to reconstruct, as accurately 
as possible, any X from the data set given only a noisy version X. As shown in Figure [ij 
the reconstruction distribution P{X\X) may often be much easier to learn than the data 
distribution P{X), because P{X\X) tends to be dominated by a single or few major modes 
(such as the roughly Gaussian shaped density in the figure). What we call a major mode 
is one that is surrounded by a substantial amount of probability mass. There may be a 
large number of minor modes that can be safely ignored in the context of approximating a 
distribution, but the major modes should not be missed. 


But how does learning the reconstruction distribution help us solve our original problem of 
modeling P[X)1 The two problems are clearly related, because if we knew everything about 
P(A), then our knowledge of the C{X\X) that we chose would allow us to precisely specify 
the optimal reconstruction function via Bayes rule: P{X\X) = ^C{X\X)P{X), where z is 
a normalizing constant that does not depend on X. As one might hope, the relation is also 
true in the opposite direction: once we pick a method of adding noise, C{X\X), knowledge 
of the corresponding reconstruction distribution P(X\X) is sufficient to recover the density 
of the data P{X). 


In a recent paper, Alain and Bengio (2013) showed that denoising auto-encoders with 
small Gaussian corruption and squared error loss estimated the score (derivative of the log- 
density with respect to the input) of continuous observed random variables, thus implicitly 
estimating P{X). The following Proposition generalizes this to arbitrary variables (dis¬ 
crete, continuous or both), arbitrary corruption (not necessarily asymptotically small), and 
arbitrary loss function (so long as they can be seen as a log-likelihood). 


Proposition 1 Let P{X) be the training distribution for which we only have empirical 
samples. Let C{X\X) be the fixed corruption distribution and Pg{X\X) be the trained re¬ 
construction distribution (assumed to have sufficient capacity). We define a Markov chain 
that starts at some Xq ~ P{X) and then iteratively samples pairs of values {Xk,Xk) by 
alternatively sampling from C{Xk\Xk) and from Pg{Xk+i\Xk)- 
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Let TT be the stationary distribution of this Markov chain 
sequence of values of {Xk}^^Q. 


when we consider only the 


If we assume that this Markov chain is irreducible, that its stationary distribution exists, 
and if we assume that is the distribution that minimizes optimally the following 

expected loss 

C = [ [ P{X)C{X\X)logPe{X\X)dXdX, 

Jx Jx 

then we have that the stationary distribution tt is the same as the training distribution P{X). 

Proof If we look at the density P{X) = f P(X)C(XlX)dX that we get for X by applying 
C(X\X) to the training data from P{X), we can rewrite the loss as a KL divergence 



X Jx 


P{X)C(X\X)logPe{X\X)dXdX = -KL (^P{X)C{X\X)\\Pe{X\X)P{X)'^ + cst 


where the constant is independent of P 0 {X\X). This expression is maximized when we have 
a P 0 {X\X) that satisfies 

P{X)C{X\X) = Pg{X\X)P{X). (1) 

In that case, we have that 


where P{X\X) represents the true conditional that we get through the usual application of 
Bayes’ rule. 

Now, when we sample iteratively between C{Xk\Xk) and Pg* {Xk+i\Xk) to get the Markov 
chain illustrated above, we are performing Gibbs sampling. We understand what Gibbs 
sampling does, and here we are sampling using the two possible ways of expressing the joint 
from equation 0. This means that the stationary distribution vr of the Markov chain will 
have P{X) as marginal density when we look only at the X^ component of the chain. 


Beyond proving that P{X\X) is sufficient to reconstruct the data density. Proposition!^ 
also demonstrates a method of sampling from a learned, parametrized model of the density, 
Pg{X), by running a Markov chain that alternately adds noise using C{X\X) and denoises 
by sampling from the learned Pg{X\X), which is trained to approximate the true P{X\X). 

Before moving on, we should pause to make an important point clear. Alert readers 
may have noticed that P{X\X) and P{X) can each be used to reconstruct the other given 
knowledge of C{X\X). Further, if we assume that we have chosen a simple C(X\X) (say, a 
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uniform Gaussian with a single width parameter), then P{X\X) and P{X) must both be 
of approximately the same complexity. Put another way, we can never hope to combine 
a simple C{X\X) and a simple P{X\X) to model a complex P{X). Nonetheless, it may 
still be the case that P{X\X) is easier to model than P{X) due to reduced computational 
complexity in computing or approximating the partition functions of the conditional dis¬ 
tribution mapping corrupted input X to the distribution of corresponding clean input X. 
Indeed, because that conditional is going to be mostly assigning probability to X locally 
around X, P{X\X) has only one or a few major modes, while P{X) can have a very large 
number of them. 

So where did the complexity go? P{X\X) has fewer major modes than P{X), but the 
location of these modes depends on the value of X. It is precisely this mapping from X —>■ 
mode location that allows us to trade a difficult density modeling problem for a supervised 
function approximation problem that admits application of many of the usual supervised 
learning tricks. 


In the Gaussian noise example, what happens is that the tails of the Gaussian are expo¬ 
nentially damping all but the modes that are near X, thus preserving the actual number 
of modes but considerably changing the number of major modes. In the Appendix we also 
present one alternative line of reasoning based on a corruption process C{X\X) that has 
finite local support, thus completely removing the modes that are not in the neighborhood 
of X. We argue that even with such a corruption process, the stationary distribution tt will 
match the original P{X), so long as one can still visit all the regions of interest through a 
sequence of such local jumps. 

Two potential issues with Proposition]^ are that 1) we are learning distribution P 9 {X\X) 
based on experimental samples so it is only asymptotically minimizing the desired loss, and 
2) we may not have enough capacity in our model to estimate Pg{X\X) perfectly. 

The issue is that, when running a Markov chain for infinitely long using a slightly imper¬ 
fect Pg{X\X), these small differences may affect the stationary distribution vr and compound 
over time. We are not allowed to “adjust” the Pg{X\X) as the chain runs. 

This is addressed by Theorem cited in the later Section |3.4[ That theorem gives us a 
result about continuity, so that, for “well-behaved” cases, when Pg{X\X) is close to P{X\X) 
we must have that the resulting stationary distribution vr is close to the original P{X). 


3.2 Walkback algorithm for training denoising anto-encoders 

In this section we describe the walkback algorithm which is very similar to the method 
from Proposition 1, but helps training to converge faster. It differs in the training samples 
that are used, and the fact that the solution is obtained through an iterative process. The 
parameter update changes the corruption function, which changes the X in the training 
samples, which influences the next parameter update, and so on. 
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Figure 2: Walkback samples get attracted by 
spurious modes and contribute to removing 
them. Segment of data manifold in violet and 
example walkback path in red dotted line, 
starting on the manifold and going towards 
a spurious attractor. The vector field repre¬ 
sents expected moves of the chain, for a uni- 
modal P{X\X), with arrows from X to X. 
The name walkback is because this proce¬ 
dure forces the model to learn to walk back 
from the random walk it generates, towards 
the X’s in the training set. 



Sampling in high-dimensional spaces (like in experiments in Section 4.1) using a simple 
local corruption process (such as Gaussian or salt-and-pepper noise) suggests that if the 
corruption is too local, the DAE’s behavior far from the training examples can create spu¬ 
rious modes in the regions insufficiently visited during training. More training iterations 
or increasing the amount of corruption noise helps to substantially alleviate that problem, 
but we discovered an even bigger boost by training the Markov chain to walk back towards 
the training examples (see Figure]^. We exploit knowledge of the currently learned model 
P 0 {X\X) to define the corruption, so as to pick values of X that would be obtained by 
following the generative chain: wherever the model would go if we sampled using the gener¬ 
ative Markov chain starting at a training example X, we consider to be a kind of “negative 
example” X from which the auto-encoder should move away (and towards A). The spirit 
of this procedure is thus very similar to the CD-A: (Contrastive Divergence with k MCMC 
steps) procedure proposed to train RBMs (Hinton, 1999[ Hinton et ah, 2006). 


We start by defining the modified corruption process Ck{X\X) that samples k times 
alternating between C{X\X) and the current Pg{X\X). 

We can express this recursively if we let Ci{X\X) be our original C{X\X), and then define 

Cfc+i(A|A) = f [ C{X\X')Pe{X'\X')Ck{X'\X)dX'dX' (2) 

Jx' Jx' 

Note that this corruption distribution Ck{X\X) now involves the distribution Pq{X\X) that 
we are learning. 

With the help of the above definition of Ck{X\X), we define the walkback corruption 
process Cwb(A|A). To sample from C^b, we first draw a k distributed according to some 
distribution, e.g., a geometric distribution with parameter p = 0.5 and support on k € 
{1,2,...}), and then we sample according to the corresponding Ck{X\X). Other values than 
p = 0.5 could be used, but we just want something convenient for that hyperparameter. 
Conceptually, the corruption process Cwb means that, from a starting point X we apply 
iteratively the original C and Pg, and then we flip a coin to determine if we want to do it 
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again. We re-apply until we lose the coin flip, and then this gives us a final value for the 
sample X based on X. 

The walkback loss is given by 

1 ^ 

(3) 

i=l 

for samples drawn from X ~ P{X), k ~ Geometric(0.5) and X ~ Cfc(X|X). 

Minimizing this loss is an iterative process because the samples used in the empirical ex¬ 
pression depend on the parameter 6 to be learned. This iterated minimization is what we 
call the walkback algorithm. Samples are generated with the current parameter value 9t, 
and then the parameters are modified to reduce the loss and yield Ot+i- We repeat until the 
process stabilizes. In practical applications, we do not have infinite-capacity models and we 
do not have a guarantee that the walkback algorithm should converge to some 9*. 


3.2.1 Reparametrization Trick 


Note that we do not need to analytically marginalize over the latent variables involved: we 
can back-propagate through the chain, considering it like a recurrent neural network with 
noise (the corruption) injected in it. This is an instance of the so-called reparametrization 
trick, already proposed in (Bengio, 2013 Kingma, 2013 Kingma and Welling 2014). The 


idea is that we can consider sampling from a random variable conditionally on others (such 
as X given X) as equivalent to applying a deterministic function taking as argument the 
conditioning variables as well as some i.i.d. noise sources. This view is particularly useful for 
the more general GSNs introduced later, in which we typically choose the latent variables 
to be continuous, i.e., allowing to backprop through their sampling steps when exploiting 
the reparametrization trick. 


3.2.2 Equivalence of the Walkback Procedure 

With the walkback algorithm, one can also decide to include or not in the loss function all 
the intermediate reconstruction distributions through which the trajectories pass. That is, 
starting from some Xq, we sample 

Xo~P(X) Xo~C(Xo|Xo), 

Xi~P,(Xi|Xo) Xi~C(Xi|Xi) 

X 2 ~ Pe{X 2 \Xi) X 2 ~ C(X 2 |X 2 ) 


Xk-i ~ Pe{Xk-i\Xk-2) Xk-i ~ CiXk-i\Xk-i) 

and we use all the pairs (X, X^) as training data for the walkback loss at equation 

The following proposition looks very similar to Proposition but it uses the walkback 
corruption instead of the original corruption C(X|X). It is also an iterated process through 
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which the current value of the parameter 9t sets the loss function that will be minimized 
by the updated 6 t+i- 

Proposition 2 Let P{X) he the training distribution for which we only have empirical 
samples. Let vr(X) be the implicitly defined asymptotic distribution of the Markov chain 
alternating sampling from P 0 {X\X) and C{X\X), where C is the original local corruption 
process. 

If we assume that P 0 {X\X) has sufficient capacity and that the walkback algorithm con¬ 
verges (in terms of being stable in the updates to P 0 {X\X)), then tt{x) = P{X). 

That is, the Markov chain defined by alternating P 0 {X\X) and C{X\X) gives us samples 
that are drawn from the same distribution as the training data. 

Proof 

Consider that during training, we produce a sequence of estimators P 0 fiX\X) where 
corresponds to the t-th training iteration (modifying the parameters after each iteration). 
With the walkback algorithm, is used to obtain the corrupted samples X from which 

the next model is produced. 

If training converges in terms of 6 t ^ 9*, it means that we have found a value of Pg* {X\X) 
such that 

1 ^ 

9* = argmin^ — ^logP 0 (wW|xW) 

i=l 

for samples {X^^\ drawn from X ~ P{X), X ~ Cwb(-^|-^)- 

By Proposition [l| we know that, regardless of the the corruption Cany(-^|-^) used, when 
we have a P 0 {X\X) that minimizes optimally the loss 





P{X)Cf,tiYiX\X)logP0{X\X)dXdX 


then we can recover P{X) by alternating between Cany(W|X) and P 0 {X\X). 

Therefore, once the model is trained with walkback, the stationary distribution tt of the 
Markov chain that it creates has the same distribution P{X) as the training data. 

Hence if we alternate between the original corruption C{X\X) and the walkback solution 
P 0 *{X\X), then the stationary distribution with respect to X is also P{X). ■ 


Note that this proposition applies regardless of the value of geometric distribution used 
to determine how many steps of corruption will be used. It applies whether we keep all the 
samples along to the way, or only the one at the last step. It applies regardless of if we use 
a geometric distribution to determine which Ck to select, or any other type of distribution. 

A consequence is that the walkback training algorithm estimates the same distribution 
as the original denoising algorithm, but may do it more efficiently (as we observe in the 
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experiments), by exploring the space of corruptions in a way that spends more time where 
it most helps the model to kill off spurious modes. 

The Markov chain that we get with walkback should also generally mix faster, be less 
susceptible to getting stuck in bad modes, but it will require a Pq* {X\X) with more capacity 
than originally. This is because P 0 *{X\X) is now less local, covering the values of the initial 
X that could have given rise to the X resulting from several steps of the Markov chain. 


3.3 Walkbacks with individual scaling factors to handle uncertainty 


The use of the proposed walkback training procedure is effective in suppressing the spurious 
modes in the learned data distribution. Although the convergence is guaranteed asymp¬ 
totically, in practice, given limited model capacity and training data, it has been observed 
that the more walkbacks in training, the more difficult it is to maximize Pq{X\X). This 
is simply because more and more noise is added in this procedure, resulting in X that is 
further away from A, therefore a potentially more complicated reconstruction distribution. 


In other words, Pq{X\X) needs to have the capacity to model increasingly complex re¬ 
construction distributions. As a result of training, a simple, or usually unimodal P 0 {X\X) 
is most likely to learn a distribution with a larger uncertainty than the one learned without 
walkbacks in order to distribute some probability mass to the more complicated and mul¬ 
timodal distributions implied by the walkback training procedure. One possible solution 
to this problem is to use a multimodal reconstruction distribution such as in jOzair et al. 
(2014), Larochelle and Murray (2011), or Dinh et al. (2015). We propose here another 


solution, which can be combined with the above, that consists in allowing a different level 
of entropy for different steps of the walkback. 


3.3.1 Scaling trick in binary X 

In the case of binary A, the most common choice of the reconstruction distribution is the 
factorized Multinoulli distribution where Pe{X\X) = 0 ^= 1 ^ dimen¬ 

sionality of A. Each factor P 0 (A*|A) is modeled by a Bernoulli distribution that has its 
parameter pi = sigmoid(/i(A)) where /*(•) is a general nonlinear transformation realized 
by a neural network. We propose to use a different scaling factor ak for different walkback 
steps, resulting in a new parameterization = sigmoid(afc/j(A)) for the k-th walkback 
step, with a/c > 0 being learned. Uk effectively scales the pre-activation of the sigmoid 
function according to the uncertainty or entropy associated with different walkback steps. 
Naturally, later reconstructions in the walkback sequence are less accurate because more 
noise has been injected. Hence, given the ki-th. and kj-th. walkback steps that satisfy ki < kj, 
the learning will tend to result in > akj because larger ak correspond to less entropy. 

3.3.2 Scaling trick in real-valued A 

In the case of real-valued A, the most common choice of Pe{X\X) is the factorized Gaussian. 
In particular, each factor P£)(A*| A) is modeled by a Normal distribution with its parameters 
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Hi and (Ti- Using the same idea of learning separate scaling factors, we can parametrize it 
as Pe(X*|X) = J\f{Hi,akcr‘f) for the k-th walkback step, is positive and also learned. 
However, Given the ki-th. and kj-th. walkback steps that satisfy ki < kj, the learning will 
result ak^ < a^j , since in this case, larger indicates larger entropy. 

3.3.3 Sampling with the learned scaling factors 

After learning the scaling factors ak for k different walkback steps, the sampling is straight¬ 
forward. One noticeable difference is that we have learned k Markov transition operators. 
Although, asymptotically all k Markov chains generate the same distribution of X, in prac¬ 
tice, they result in different distributions because of the different learned. In fact, using 
ai results the samples that are sharper and more faithful to the data distribution. We 
verify the effect of learning the scaling factor further in the experimental section. 


3.4 Extending the denoising auto-encoder to more general GSNs 

The denoising auto-encoder Markov chain is defined by Xt ~ C{X\Xt) and Xt+i ~ Pe(X| W), 
where Xt alone can serve as the state of the chain. The GSN framework generalizes the 
DAE in two ways: 

1. the “corruption” function is not fixed anymore but a parametrized function that can 
be learned and corresponds to a “hidden” state (so we write the output of this function 
H rather than X); and 

2. that intermediate variable H is now considered part of the state of the Markov chain, 
i.e., its value of Ht at step t of the chain depends not just on the previous visible Xt-i 
but also on the previous state Ht-i- 

For this purpose, we define the Markov chain associated with a GSN in terms of a visible 
Xt and a latent variable Ht as state variables, of the form 

Ht+i ~ Pe,{H\Ht,Xt) 

Xt+i ~ Pe,{X\Ht+i). 



This definition makes denoising auto-encoders a special case of GSNs. Note that, given that 
the distribution of Ht+i may depend on a previous value of Ht, we find ourselves with an 
extra Hq variable added at the beginning of the chain. This Hq complicates things when 
it comes to training, but when we are in a sampling regime we can simply wait a sufficient 
number of steps to burn in. 
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3.4.1 Main result about GSNs 


The next theoretical results give conditions for making the stationary distributions of the 
above Markov chain match a target data-generating distribution. It basically says that, 
in order to estimate the data-generating distribution P{Xq), it is enough to achieve two 
conditions. 


The first condition is similar to the one we obtain when minimizing denoising reconstruc¬ 
tion error, i.e., we must make sure that the reconstruction distribution P{Xi\Hi) approaches 
the conditional distribution P{Xq\Hi), i.e., the Xq’s that could have given rise to Hi. 

The second condition is novel and regards the initial state Hq of the chain, which in¬ 
fluences Hi. It says that P{Hq\Xq) must match P{Hi\Xq). One way to achieve that is 
to initialize Ho associated with a training example Xq with the previous value of Hi that 
was sampled when example Xq was processed. In the graphical model in the statement of 
Theorem]^ note how the arc relating Xq and Ho goes in the Xq —?■ Ho direction, which is 
different from the way we would sample from the GSN (graphical model above), where we 
have Ho —)• Xo- Indeed, during training, Xo is given, forcing it to have the data-generating 
distribution. 


Note that Theorem is there to provide us with a guarantee about what happens when 
those two conditions are satisfied. It is not originally meant to describe a training method. 


In section 3.4.3 we explain how to these conditions could be approximately achieved. 


Theorem 3 Let (Hf, Xt)^Q be the Markov chain defined by the following graphical model. 

• • 


If we assume that the chain has a stationary distribution tth.x, and that for every value of 
(x, h) we have that 

• all the P{Xt = x\Ht = h) = g{x\h) share the same density for t > 1 

• all the P{Ht+i = h\Ht = h',Xt = x) = f{h\h',x) shared the same density for t>0 

• P{Ho = h\Xo = x) = P{Hi = h\Xo = x) 

• P{Xi = x\Hi = h) = P{Xo = x\Hi = h) 
then for every value of {x, h) we get that 

• P{Xo = x\Ho = h) = g{x\h) holds, which is something that was assumed only for 
t > 1 
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• P{Xt = X, Ht = h) = P{Xq = x,Hq = h) for all t >0 

• the stationary distribution 'Kh,x has a marginal distribution ttx such that vr (x) = 
P{Xo = x). 

Those conclusions show that our Markov chain has the property that its samples in X are 
drawn from the same distribution as Xq. 

Proof The proof hinges on a few manipulations done with the first variables to show that 
P{Xt = x\Ht = h) = g{x\h), which is assumed for t > 1, also holds for t = 0. 

For all h we have that 

P{Hq = h) = j P{Ho = h\Xo = x)P{Xo = x)dx 

= J P{Hi = h\XQ = x)P{Xq = x)dx (by hypothesis) 

= P{Hi=h). 

The equality in distribution between {Xi^Hi) and {Xq^Hq) is obtained with 

P{Xi =x,Hi = h) = P{Xi = x\Hi = h)P{Hi = h) 

= P{Xq = x\P[i = h)P{Hi = h) (by hypothesis) 

= P{Xo = x,Hi = h) 

= P{Hi = h\Xo = x)P[Xo = x) 

= P{Hq = h\XQ = x)P{Xq = x) (by hypothesis) 

= P{Xo = x,Ho = h). 


Then we can use this to conclude that 

P{Xo = x,Ho = h) = P{Xi = x,Hi= h) 

P{Xo = x\Ho = h) = P{Xi = x\Hi = h) = g{x\h) 

so, despite the arrow in the graphical model being turned the other way, we have that the 
density of P{Xq = x\P[q = h) is the same as for all other P{Xt = x\Ht = h) with t > 1. 

Now, since the distribution of Hi is the same as the distribution of Hq, and the transition 
probability P[Hi = h\HQ = h') is entirely defined by the (/, 5 ) densities which are found 
at every step for all t > 0, then we know that {X 2 ,H 2 ) will have the same distribution as 
{Xi,Hi). To make this point more explicitly, 

PiHi = h\Ho = h') = j P{Hi = h\Ho = h', Xq = x)P{Xo = x\Ho = h')dx 

= j fih\h',x)g{x\h')dx 


= j P{H 2 = h\Hi = h', Xi = x)PiXi = x\Hi = h')dx 
= P{H2 = h\Hi = h') 
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This also holds for P{H 3 \H 2 ) and for all subsequent P(Ht+i\Ht). This relies on the crucial 
step where we demonstrate that P{Xq = x\Hq = h) = g(x\h). Once this was shown, then 
we know that we are using the same transitions expressed in terms of (/, g) at every step. 

Since the distribution of Hq was shown above to be the same as the distribution of Hi, 
this forms a recursive argument that shows that all the Ht are equal in distribution to Hq. 
Because g{x\h) describes every P{Xt = x\Ht = h), we have that all the joints {Xt,Ht) are 
equal in distribution to {Xq,Hii). 

This implies that the stationary distribution ttx,h is the same as that of {Xq,Hq). Their 
marginals with respect to X are thus the same. ■ 


Intuitively, the proof of Theoremachieves its objective by forcing all the {Ht,Xt) pairs 
to share the same joint distribution, thus making the marginal over Xt as t ^ oo (i.e. the 
stationary distribution of the chain vr) be the same as P{Xo), i.e., the data distribution. On 
the other hand, because it is a Markov chain, its stationary distribution does not depend 
on the initial conditions, making the model generate from an estimator of P{Xo) for any 
initial condition. 

To apply Theorem in a context where we use experimental data to learn a model, we 
would like to have certain guarantees concerning the robustness of the stationary density 
TTx- When a model lacks capacity, or when it has seen only a finite number of training 
examples, that model can be viewed as a perturbed version of the exact quantities found 
in the statement of Theorem [3l 


3.4.2 A NOTE ABOUT CONSISTENCY 


A good overview of results from perturbation theory discussing stationary distributions in 
finite state Markov chains can be found in (Cho et ah, 2000). We reference here only one 
of those results. 


Theorem 4 Adapted from (Schweitzer , 1968) 


Let K he the transition matrix of a finite state, irreducible, homogeneous Markov chain. 
Let TT be its stationary distribution vector so that Ktt = vr. Let A = I—K and Z = [A + C)~^ 
where C is the square matrix whose columns all contain tt. Then, if K is any transition 
matrix (that also satisfies the irreducible and homogeneous conditions) with stationary dis¬ 
tribution TT, we have that 


vr — vr 


li< 


K-K 


OO 


This theorem covers the case of discrete data by showing how the stationary distribution 
is not disturbed by a great amount when the transition probabilities that we learn are 
close to their correct values. We are talking here about the transition between steps of the 
chain (Aq, Hq), {Xi, Hi ),..., {Xt, Ht), which are defined in Theorem through the (/, g) 
densities. 


17 












3.4.3 Training criterion for GSNs 


So far we avoided discussing the training criterion for a GSN. Various alternatives exist, 
but this analysis is for future work. Right now Theorem suggests the following rules : 

• Define g{x\h) = P{Xi = x\Hi = h), i.e., the decoder, to be the estimator for 
P{Xq = x\Hi = h), e.g. by training an estimator of this conditional distribution 
from the samples {Xq,Hi), with reconstruction likelihood, logP(Vi = xo\Hi), as 
this would asymptotically achieve the condition P{Xq\Hi) = P{Xi\Hi). To see that 
this is true, consider the following. 

We sample Xq from P{Xo) (the data-generating distribution) and Hi from P{Hi\Ho, Xq). 
Refer to one of the next bullet points for an explanation about how to get values for 
Hq to be used when sampling from P{Hi\H(), Xq) here. This creates a joint distri¬ 
bution over (Xo,Hi) that has P{Xo\Hi) as a derived conditional. Then we train the 
parameters of a model Pg{Xi\Hi) to maximize the log-likelihood 


E, 


XO- 


■^p{XQ},hi''-P{Hi\xo)[^^S Pe{Xi — xo\hi)] 


I 

J XQ^hl 


P{xq, hi) \ogP 0 {Xi = xo\Hi = hi)dxodhi 


= f P{hi) f P{Xo = xo\Hi = hi)logP 0 {Xi = xo\hi)dxodhi 

J hi j XQ 

= -EHA^L{P{Xo\Hi)\\P0{Xi\Hi))] + const. 


(4) 


where the constant does not depend on 9, and thus the log-likelihood is maximized 
when 

P 0 {Xi = x\Hi = h) = P{Xo = x\Hi = h). 


• Pick the transition distribution f{h\h',x) to be useful, i.e., training it towards the 
same objective, i.e., sampling an h' that makes it easy to reconstruct x. One can 
think of f{h\h',x) as the encoder, except that it has a state which depends on its 
previous value in the chain. 

• To approach the condition P{Hq\Xq) = P{Hi\Xo), one interesting possibility is the 
following. For each Xq in the training set, iteratively sample Hi\(Ho, Xq) and substi¬ 
tute the value of Hi as the updated value of Hq. Repeat until you have achieved a 
kind of “burn in”. Note that, after the training is completed, when we use the chain 
for sampling, the samples that we get from its stationary distribution do not depend 
on Hq. Another option is to store the value of Hi that was sampled for the particular 
training example xq, and re-use it as the initial Hq the next time that xq is presented 
during training. These techniques of substituting Hi into Hq are only required during 
training. In our experiments, we actually found that a fixed Hq = 0 worked as well, 
so we have used this simpler approach in the reported experiments. 

• The rest of the chain for f > 1 is defined in terms of {f,g). 
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3.5 Random variable as deterministic function of noise 


There several equivalent ways of expressing a GSN. One of the interesting formulations is 
to use deterministic functions of random variables to express the densities (/, g) used in 
Theorem]^ With that approach, we define Ht+i = Zt, Ht) for some independent 

noise source Zt-, and we insist that Xt cannot be recovered exactly from Rt+i, to avoid a 
situation in which the Markov chain would not be ergodic. The advantage of that formu¬ 
lation is that one can directly back-propagate the reconstruction log-likelihood logP(Xi = 
xo\Hi = f{Xo,Zo,HQ)) into all the parameters of / and g, using the reparametrization 


trick discussed above in Section 3.2.1 This method is described in (Williams, 1992). 


In the setting described at the beginning of Section]^ the function playing the role of the 
“encoder” was fixed for the purpose of the theorem, and we showed that learning only the 
“decoder” part (but a sufficiently expressive one) sufficed. In this setting we are learning 
both, which can cause certain broken behavior. 


One problem would be if the created Markov chain failed to converge to a stationary 
distribution. Another such problem could be that the function 4>{Xt, Zt, Ht) learned would 
try to ignore the noise Zt, or not make the best use out of it. In that case, the reconstruction 
distribution would simply converge to a Dirac at the input X. This is the analogue of the 
constraint on auto-encoders that is needed to prevent them from learning the identity 
function. Here, we must design the family from which / and g are learned such that when 
the noise Z is injected, there are always several possible values of X that could have been 
the correct original input. 

Another extreme case to think about is when 4>{X, Z,H) is overwhelmed by the noise 
and has lost all information about X. In that case the theorems are still applicable while 
giving uninteresting results: the learner must capture the full distribution of X in {X\H) 
because the latter is now equivalent to Pq^{X), since (/>(X, Z, H) no longer contains informa¬ 
tion about X. This illustrates that when the noise is large, the reconstruction distribution 
(parametrized by 62 ) will need to have the expressive power to represent multiple modes. 
Otherwise, the reconstruction will tend to capture an average output, which would visually 
look like a fuzzy combination of actual modes. In the experiments performed here, we have 
only considered unimodal reconstruction distributions (with factorized outputs), because we 
expect that even if P{X\H) is not unimodal, it would be dominated by a single mode when 
the noise level is small. However, future work should investigate multimodal alternatives. 


A related element to keep in mind is that one should pick the family of conditional 
distributions Pe^iXlH) so that one can sample from them and one can easily train them 
when given {X,H) pairs, e.g., by maximum likelihood. 


3.6 Handling missing inpnts or structured output 

In general, a simple way to deal with missing inputs is to clamp the observed inputs and 
then run the Markov chain with the constraint that the observed inputs are fixed and 
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not resampled at each time step, whereas the unobserved inputs are resampled each time, 
conditioned on the clamped inputs. 

In the context of the GSN described in Section [3.4| using the two distributions 

Ht+i ~ Pe,{H\Ht,Xt) 

Xt+i ~ Pe,{X\Ht+i) 

we need to make some adjustments to to be able to sample X conditioned 

on some of its components being clamped. We also focus on the case where there are no 
connections between the Ht —?• Ht+i. That is, we study the more basic situation where we 
train an denoising auto-encoder instead of a GSN that has connections between the hidden 
units. 

Let 5 be a set of values that X can take. For example, S can be a subset of the units 
of X that are hxed to given values. We can talk about clamping X € S, or just “clamping 
5” when the meaning is clear. In order to sample from a distribution with clamped S, we 
need to be able to sample from 

Ht+i ~ P9,{H\Xt) 

Xt+i ~ Pe,iX\Ht+i,X € S). 

This notation might be strange at first, but it’s as legitimate as conditioning on 0 < X when 
sampling from any general distribution. It involves only a renormalization of the resulting 
distribution PQ^{X\P[t^i, X £ S). 

In a general scenario with two conditional distributions (P 6 ij,P 02 ) the roles of 

f{x\h) and g{h\x), i.e. the encoder and decoder, we can make certain basic assumptions 
so that the asymptotic distributions of {Xt,Ht) and (Xt,Ht+i) both exist. There is no 
reason to think that those two distributions are the same, and it is trivial to construct 
counter-examples where they differ greatly. 

However, when we train a DAE with infinite capacity, Proposition shows that the opti¬ 
mal solution leads to those two joints being the same. That is, the two trained conditional 
distributions f{h\x) and g{x\h) are mutually compatible. They form a single joint distribu¬ 
tion over (X, H). We can sample from it by the usual Gibbs sampling procedure. Moreover, 
the marginal distribution over X that we obtain will match that of the training data. This 
is the motivation for Proposition!^ 

Knowing that Gibbs sampling produces the desired joint distribution over {X,H), we 
can now see how it would be possible to sample from {X,H)\{X G S) if we are able to 
sample from f{h\x) and g{x\h,x G S). Note that it might be very hard to sample from 
g{x\h.,x G 5), depending on the particular model used. We are not making any assumption 
on the factorization of g{x\h), much like we are not making any assumption on the particular 
representation (or implementation) of g{x\h). 


In section 3.4.2| we address a valid concern about the possibility that, in a practical 
setting, we might not train g{x\h) to achieve an exact match the density of X\H. That 
g{x\h) may be very close to the optimum, but it might not be able to achieve it due to 
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its finite capacity or its particular parametrization. What does that imply about whether 
the asymptotic distribution of the Markov chain obtained experimentally compared to the 
exact joint {X, H) ? 

We deal with this issue in the same way as we dealt with it when it arose in the context 
of Theorem The best that we can do is to refer to Theorem and rely on an argument 
made in the context of discrete states that would closely approximate our situation (which 
is in either discrete or continuous space). 

Our Markov chain is homogeneous because it does not change with time. It can be made 
irreducible my imposing very light constraints on f{h\x) so that f{h\x) > 0 for all {x,h). 
This happens automatically when we take f{h\x) to be additive Gaussian noise (with fixed 
parameters) and we train only g{x\h). In that case, the optimum g{x\h) will assign non-zero 
probability weight on all the values of x. 

We cannot guarantee that a non-optimal g{x\h) will not be broken in some way, but we 
can often get g{x\h) to be non-zero by selecting a parametrized model that cannot assign 
a probability of exactly zero to an x. Finally, to use Theorem]^ we need to have that the 
constant from that Theorem to be non-zero. This is a bit more complicated to 

enforce, but it is something that we will get if the transition matrix stays away from the 
identity matrix. That constant is zero when the chain is close to being degenerate. 

Theorem says that, with those conditions verified, we have that an arbitrarily good 
g{x\h) will lead to an arbitrarily good approximation of the exact joint {X, H). 

Now that we know that this approach is grounded in sound theory, it is certainly reason¬ 
able to try it in experimental settings in which we are not satisfying all the requirements, 
and see if the results are useful or not. We would refer the reader to our experiment shown 
in Figurewhere we clamp certain units and resample the rest. 

To further understand the conditions for obtaining the appropriate conditional distribu¬ 
tions on some of the visible inputs when others are clamped, we consider below sufficient and 
necessary conditions for making the stationary distribution of the clamped chain correspond 
to the normalized distribution (over the allowed values) of the undamped chain. 

Proposition 5 Let f{h\x) and g{x\h) be the encoder and decoder functions such that they 
are mutually compatible (i.e. they represent a single joint distribution for which we can 
sample using Gibbs sampling). Let tt{X, H) denote that joint. 

Note that this happens when we minimize 


Ex 



g{x\h)f{h\x)dh 


or when we minimize the walkback loss (see Proposition^. 

Let S ^ X be a set of values that X can take (e.g. some components of X can be assigned 
certain fixed values), and such that E(X G S) > 0. Let 7r(x|x G S) denote the conditional 
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distribution of tt{X,H) on which we marginalize over H and eondition on X £ S. That is, 


7r(x|x £ S) (X 7r(x)I(x G S) 


7r(x, h)I(x £ S)dh 
fx fh h)I{x £ S)dhdx 


Let g{x\h, x £ S) denote a restriction of the decoder function that puts probability 
only on the values of x £ S. That is, 


weight 


g{x\h,x £ S) (X g{x\h)l{x £ S). 


If we start from some xq £ S and we run a Markov ehain by alternating between f{h\x) 
and g{x\h,x £ S), then the asymptotic distribution of that chain with respect to X will be 
the same as 7r(x|x G S). 


3.7 General conditions for clamping inputs 

In the previous section we gave a suffieient condition for “clamping 5” to work in the 
context of a Markov chain based on an encoder distribution with density f{h\x) and a 
decoder distribution with density g{x\h). 

In this section, we will give a sufficient and necessary condition on the sufficient and 
necessary conditions for handling missing inputs by clamping observed inputs. 

Proposition 6 Assume we have an ergodic Markov chain with transition operators having 
density f{h\x) and g{x\h). Its unique stationary distribution is 7r{x,h) over X xH which 
satisfies: 

( 7r(x, h)f{h'\x)g{x\h')dxdh = 7r(x', h'). 

Jxxn 

Assume that we start from {Xq,Hq) = (xo,/io) where xq £ S, S f X (S can be con¬ 
sidered as a constraint over X) and we sample {Xt+i, Ht+i) by first sampling Ht+i with 
encoder f{Ht+i\Xt) and then sampling Xt+i with decoder g{Xt+i\Ht+i, Xt+i £ S), the new 
stationary distribution we reach is 7rs{x,h). 

Then a sufficient condition for 


= 7r(x|x G S) 


is for 7r(x|x G 5) to satisfy 


7r(x|x G S)f{h'\x)dx = Tr{h'\x £ S) 


where 7r(x|x G S) and 7r(/i'|x G S) are conditional distributions 

/ , ...N 7r(x) f^Tr(x,h')dx 

7 r{x\x£S) = -— , , 7r{h\x£S) = - - - — 

J^Tr{x)dx J^^y^TT[x, hjdxah 


( 5 ) 
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Proof Based on the assumption that the chain is ergodic, we have that tts{X,H) is the 
unique distribution satisfying 

f 7rs{x,h)f{h'\x)g{x'\h',x'^S)dxdh = 7Ts{x',h'). (6) 

JsxH 

Now let us check if tt{x, h\x G S) satisfies the equation above. 

The Markov chain described in the statement of the Theorem is defined by looking at 
the slices {Xt, Ht). This means that, by construction, the conditional density 7r(x|/i) is just 
given hy g{x\h). 

This relation still holds even if we put the S constraint on x 

g{x'\h', x' G 5) = 7r(x'|/i', x' G 5). 

Now if we substitute 7 rs{x,h) by 7r(x,/i|x G S) in Equation]^ the left side of Equation]^ 
becomes 


/ 7r(x, h\x G S)f{h'\x)7r{x'\h', x' G S)dxdh 

JSxH 

= 7r(x'|/i', x'G 5) f (/ Tr{x, h\x € S)dh)f{h'\x)dx 
Js Jn 

= 7r{x'\h', x'€ S) j 7r(x|x G 5)/(/i'|x)dx 

= 'K{x'\h\x' G S)Tr{h'\x G S) (using Equation]^ 

= 7r(x'|/i', x'G 5)7r(/i'|x'G 5) 

= 7r(x', h'\x' G 5). 

This shows that 7r(x, h\x G S) satisfies Equation]^ Due to the ergodicity of the chain, the 
distribution tts{x, h) that satisfies Equationj^is unique, so we have •ns{x, h) = 7r(x, h\x G S). 
By marginalizing over h we get 


■^ 5 ( 3 ^) = 7r(x|x G S). 


Proposition gives a sufficient condition for dealing missing inputs by clamping observed 
inputs. Note that this condition is weaker than the mutually compatible condition discussed 
in Section 3.6 Furthermore, under certain circumstances, this sufficient condition becomes 


necessary, and we have the following proposition : 


Proposition 7 Assume that the Markov chain in Proposition has finite discrete state 
space for both X and H. The condition in Equation^ in Proposition^ becomes a necessary 
condition when all discrete conditional distributions g{x\h,x G S) are linear independent. 

Proof We follow the same notions in Proposition and now we have Trs{x) = 7r(x|x G 
S). Because '7r5(x) is the marginal of the stationary distribution reached by alternatively 
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sampling with encoder f{H\X) and decoder g{X\H,X £ S), we have that 'k{x\x £ S) 
satisfies 


•k{x\x £ S)( / f(h'lx)7r(x'lh', x' £ S)dh')dx = 'k{x'\x £ S) 

Jn 


which is a direct conclusion from Equation]^ when considering the fact that 7rs(x) = 'k{x\x £ 
S) and g{x'\h',x' £ S) = 7r(x'|/i',x' G S). If we re-arrange the integral of equation above, 
we get: 

f 7 r{x'\h',x'£ S){ f 7 r{x\x £ S)f{h'\x)dx)dh'= 7 r{x'\x'£ S). (7) 

Jn Js 

Note that j'^7r(x|x £ S)f{h'\x)dx is the same as the left side of Equationj^in Proposition 
and it can be seen as some function F{h') satisfying ^^F{h')dh' = 1. Because we 
have considered a GSN over a finite discrete state space X = {xi,-- - ,xn} and 77 = 
{hi, ■ ■ ■ , Hm}, the integral in Equationbecomes the linear matrix equation 


G 


F = P 


XJ 


where G{i,j) = g{x'j}h'j,x' £ S) = 7r(x'|/i'-,x' G S), F{i) = F{h[) and Px(f) = 7r(x'|x' G S). 
In other word, F is a solution of the linear matrix equation 


G Z = P,,. 


From the definition of G and P^,, it is obvious that P/i is also a solution of this linear 
matrix equation, if P/i(i) = 7r(/i'|x' G S). Because all discrete conditional distributions 
g{x\h,x £ S) are linear independent, which means that all the column vectors of G are 
linear independent, then this linear matrix equation has no more than one solution. Since 
P/j is the solution, we have F = P/^, equivalently in integral form 

F{h') = f 7r(x|x G S)f{h'\x)dx = 7r(/i'|x G S) 

Js 

which is the condition Equationin Proposition!^ 


Propositionsays that at least in discrete finite state space, if the g{x\h,x £ S) satisfies 
some reasonable condition like linear independence, then along with Proposition the 
condition in Equation is the necessary and sufficient condition for handling missing inputs 
by clamping the observed part for at least one subset S. If we want this result to hold for 
any subset S, we have the following proposition: 

Proposition 8 If the condition in Equation in Proposition holds for any subset of S 
that S <£ X, then we have 

f{h'\x) = 7r(fi'|x) 

In other words, f{h\x) and g{x\h) are two conditional distributions marginalized from a 
single joint distribution 7r(x, h). 

Proof Because S can be any subset oi X, of course that S can be a set which only has 
one element xq, i.e., S = {xq}. Now the condition in Equation]^ in Propositionbecomes 

1 • f{h'\x = Xq) = TT{h'\x = Xq). 
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Because xq can be an arbitrary element in X, we have 

f{h'\x) = 7r(/i'|x), or f{h\x) = 'K{h\x). 

Since from Proposition we already know that g{x\h) is 7r(x|/i), we have that f{h\x) and 
g{x\h) are mutually compatible, that is, they are two conditional distributions obtained by 
normalization from a single joint distribution Tr{x,h). ■ 

According to Proposition]^ if condition in Equationholds for any subset S, then f{h\x) 
and g{x\h) must be mutually compatible to the single joint distribution 7 r{x,h). 


3.8 Dependency Networks as GSNs 


Dependency networks (Heckerman et al. 2000) are models in which one estimates condi¬ 
tionals Pi{xi\x-i), where denotes x \ Xi, i.e., the set of variables other than the i-th 
one, Xi- Note that each Pi may be parametrized separately, thus not guaranteeing that 
there exists a joint of which they are the conditionals. Instead of the ordered pseudo- 
Gibbs sampler dehned in Heckerman et al. (2000), which resamples each variable Xi in 


the order xi,X 2 ,..., we can view dependency networks in the GSN framework by dehn- 
ing a proper Markov chain in which at each step one randomly chooses which variable to 
resample. The corruption process therefore just consists oi H = f{X, Z) = X-g where 
X-s is the complement of Xg, with s a randomly chosen subset of elements of X (possibly 
constrained to be of size 1). Furthermore, we parametrize the reconstruction distribu¬ 
tion as P 02 {X = x\H) = 5x_^=X-s^d2,s{Xg = Xs|x_s) where the estimated conditionals 
P 0 ^^g{Xg = are not constrained to be consistent conditionals of some joint distribu¬ 

tion over all of X. 


Proposition 9 If the above GSN Markov ehain has a stationary distribution, then the 
dependeney network defines a joint distribution (which is that stationary distribution), which 
does not have to be known in closed form. Furthermore, if the conditionals P{Xg\X-s) are 
eonsistent estimators of the ground truth conditionals, then that stationary distribution is a 
eonsistent estimator of the ground truth joint distribution. 

The proposition can be proven by immediate application of Proposition with the above 
particular GSN model definitions. 

This joint stationary distribution can exist even if the conditionals are not consistent. 
To show that, assume that some choice of (possibly inconsistent) conditionals gives rise to 
a stationary distribution vr. Now let us consider the set of all conditionals (not necessarily 
consistent) that could have given rise to that tt. Glearly, the conditionals derived from tt 
by Bayes rule are part of that set, but there are infinitely many others (a simple counting 
argument shows that the hxed point equation of vr introduces fewer constraints than the 
number of degrees of freedom that define the conditionals). To better understand why the 
ordered pseudo-Gibbs chain does not beneht from the same properties, we can consider an 
extended case by adding an extra component of the state X, being the index of the next 
variable to resample. In that case, the Markov chain associated with the ordered pseudo- 
Gibbs procedure would be periodic, thus violating the ergodicity assumption of the theorem. 
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However, by introducing randomness in the choice of which variable(s) to resample next, 
we obtain aperiodicity and ergodicity, yielding as stationary distribution a mixture over all 
possible resampling orders. These results also show in a novel way (see e.g. Hyvarinen 


(2006) for earlier results) that training by pseudolikelihood or generalized pseudolikelihood 
provides a consistent estimator of the associated joint, so long as the GSN Markov chain 
defined above is ergodic. This result can be applied to show that the multi-prediction 


deep Boltzmann machine (MP-DBM) training procedure introduced by Goodfellow et al. 


(2013) also corresponds to a GSN. This has been exploited in order to obtain much better 
samples using the associated GSN Markov chain than by sampling from the corresponding 


DBM (Goodfellow et al. 2013). Another interesting conclusion that one can draw from that 


paper and its GSN interpretation is that state-of-the-art classification error can thereby be 
obtained: 0.91% on MNIST without fine-tuning (best comparable previous DBM results was 
well above 1%) and 10.6% on permutation-invariant NORB (best previous DBM results was 
10 . 8 %). 


4. Experimental results 


The theoretical results on Generative Stochastic Networks (GSNs) open for exploration a 
large class of possible parametrizations and training procedures which share the property 
that they can capture the underlying data distribution through the GSN Markov chain. 
What parametrizations will work well? Where and how should one inject noise to best 
balance fast mixing with making the implied conditional easy to model? We present results 
of preliminary experiments with specific selections for each of these choices, but the reader 
should keep in mind that the space of possibilities is vast. 


We start in Section 4.1 with results involving GSNs without latent variables (denoising 
auto-encoders in Section 3.1 and the walkback algorithm presented in Section [3.2[ ). Then in 
Section 4.2 we proceed with experiments related to GSNs with latent variables (model de¬ 
scribed in Section 3.4). Section 4.3 extends experiments of the walkback algorithm with the 
scaling factors discussed in Section 3.3, A Theancp] ( Bergstra et ah] 2010) implementation 
is availably including the links of datasets. 


4.1 Experimental results regarding walkback in DAEs 


We present here an experiment performed with a non-parametric estimator on two types of 
data and an experiment done with a parametric neural network on the MNIST dataset. 


Non-parametric case. The mathematical results presented here apply to any denois¬ 
ing training criterion where the reconstruction loss can be interpreted as a negative log- 
likelihood. This remains true whether or not the denoising machine P{X\X) is parametrized 
as the composition of an encoder and decoder. This is also true of the asymptotic estimation 
results in Alain and Bengio (2013). We experimentally validate the above theorems in a 


1. http://deeplearning.net/software/theano/ 

2. https://github.com/yaoli/GSN 
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case where the asymptotic limit (of enough data and enough capacity) can be reached, i.e., 
in a low-dimensional non-parametric setting. Fig. shows the distribution recovered by the 
Markov chain for discrete data with only 10 different values. The conditional P[X\X) was 
estimated by multinomial models and maximum likelihood (counting) from 5000 training 
examples. 5000 samples were generated from the chain to estimate the asymptotic distri¬ 
bution TTniX). For continuous data, Figure also shows the result of 5000 generated 
samples and 500 original training examples with X G with scatter plots of pairs of 
dimensions. The estimator is also non-parametric (Parzen density estimator of P{X\X)). 


samples data 

,-p-^,- 1 i,2|- 1 - ,—,— 



Figure 3: Top leit: histogram of a data-generating distribution (true, blue), the empirical 
distribution (red), and the estimated distribution using a denoising maximum likelihood 
estimator. Other figures: pairs of variables (out of 10) showing the training samples and 
the model-generated samples. 


MNIST digits. We trained a DAE on the binarized MNIST data (thresholding at 0.5). 
The 784-2000-784 auto-encoder is trained for 200 epochs with the 50000 training examples 
and salt-and-pepper noise (probability 0.5 of corrupting each bit, setting it to 1 or 0 with 
probability 0.5). It has 2000 tanh hidden units and is trained by minimizing cross-entropy 
loss, i.e., maximum likelihood on a factorized Bernoulli reconstruction distribution. With 
walkback training, a chain of 5 steps was used to generate 5 corrupted examples for each 
training example. Figure shows samples generated with and without walkback. The 
quality of the samples was also estimated quantitatively by measuring the log-likelihood of 
the test set under a non-parametric density estimator P{x) = m.eanj^P{x\X) constructed 
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from 10,000 consecutively generated samples ( X from the Markov c hain). The expected 
value of E[P(x)] over the samples can be shown (Bengio et ah, 2013d) to be a lower bound 
(i.e. conservative estimate) of the true (implicit) model density P(x). The test set log- 
likelihood bound was not used to select among model architectures, but visual inspection 
of samples generated did guide the preliminary search reported here. Optimization hyper¬ 
parameters (learning rate, momentum, and learning rate reduction schedule) were selected 
based on the training objective. We compare against a state-of-the-art RBM (Cho et ahj 


2013) with an AIS log-likelihood estimate of -64.1 (AIS estimates tend to be optimistic). 


We also drew samples from the RBM and applied the same estimator (using the mean of the 
RBM’s P(x\h) with h sampled from the Gibbs chain), and obtained a log-likelihood non- 
parametric bound of -233, skipping 100 MCMC steps between samples (otherwise numbers 
are very poor for the RBM, which mixes poorly). The DAE log-likelihood bound with and 
without walkback is respectively -116 and -142, conhrming visual inspection suggesting that 
the walkback algorithm produces less spurious samples. However, the RBM samples can 
be improved by a spatial blur. By tuning the amount of blur (the spread of the Gaussian 
convolution), we obtained a bound of -112 for the RBM. Blurring did not help the auto¬ 
encoder. 
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Figure 4: Successive samples generated by Markov chain associated with the trained DAEs 
according to the plain sampling scheme (left) and walkback sampling scheme (right). There 
are less “spurious” samples with the walkback algorithm. 


4.2 Experimental results for GSNs with latent variables 

We propose here to explore families of parametrizations which are similar to existing deep 


stochastic architectures such as the Deep Boltzmann Machine (DBM) (Salakhutdinov and 


Hinton 2009). Basically, the idea is to construct a computational graph that is similar to 


the computational graph for Gibbs sampling or variational inference in Deep Boltzmann 
Machines. However, we have to diverge a bit from these architectures in order to accom¬ 
modate the desirable property that it will be possible to back-propagate the gradient of 
reconstruction log-likelihood with respect to the parameters 9i and 02- Since the gradient 
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of a binary stochastic unit is 0 almost everywhere, we have to consider related alternatives. 
An interesting source of inspiration regarding this question is a recent paper on estimating 


or propagating gradients through stochastic neurons (Bengio 2013). Here we consider the 
following stochastic non-linearities: hi = r/out + tanh(77in + a*) where a* is the linear activa¬ 
tion for unit i (an affine transformation applied to the input of the unit, coming from the 
layer below, the layer above, or both) and r/in and ?7out are zero-mean Gaussian noises. 

To emulate a sampling procedure similar to Boltzmann machines in which the filled-in 
missing values can depend on the representations at the top level, the computational graph 
allows information to propagate both upwards (from input to higher levels) and downwards, 
giving rise to the computational graph structure illustrated in Figure which is similar to 
that explored for deterministic recurrent auto-encoders ( Seung[ [l998t [Behnke 2001; Savard 


2011). Downward weight matrices have been fixed to the transpose of corresponding upward 


weight matrices. 



Figure 5: Left: Generic GSN Markov chain with state variables Xt and Ht. Right: GSN 
Markov chain inspired by the unfolded computational graph of the Deep Boltzmann Machine 
Gibbs sampling process, but with backprop-able stochastic units at each layer. The training 
example X = xq starts the chain. Either odd or even layers are stochastically updated 
at each step. All xt's are corrupted by salt-and-pepper noise before entering the graph 
(lightning symbol). Each xt for t > 0 is obtained by sampling from the reconstruction 
distribution for that step, Pe^iXtlHt). The walkback training objective is the sum over 
all steps of log-likelihoods of target X = xq under the reconstruction distribution. In the 
special case of a unimodal Gaussian reconstruction distribution, maximizing the likelihood is 
equivalent to minimizing reconstruction error; in general one trains to maximum likelihood, 
not simply minimum reconstruction error. 


With the walkback algorithm, a different reconstruction distribution is obtained after each 
step of the short chain started at the training example X. It means that the computational 
graph from X to a reconstruction probability at step k actually involves generating inter¬ 
mediate samples as if we were running the Markov chain starting at X. In the experiments, 
the graph was unfolded so that 2D sampled reconstructions would be produced, where D 
is the depth (number of hidden layers). The training loss is the sum of the reconstruction 
negative log-likelihoods (of target X) over all 2D reconstructions. 


Experiments evaluating the ability of the GSN models to generate good samples were 
performed on the MNIST dataset and the Toronto Face Database (TFD), following the 
setup in Bengio et al.l (2013b). 
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Theorem requires Hq to have the same distribution as Hi (given Xq) during training, 
and this may be achieved by initializing each training chain with Hq set to the previous 
value of Hi when the same example Xq was shown. However, it turned out that even with 
a dumb initialization of Hq, good results were obtained in the experiments below. 


Networks with 2 and 3 hidden layers were evaluated and compared to regular denoising 
auto-encoders. The latter has just 1 hidden layer and no state to state transition, i.e., 
the computational graph can be split into separate graphs for each reconstruction step in 
the walkback algorithm. They all have tanh hidden units and pre- and post-activation 
Gaussian noise of standard deviation 2, applied to all hidden layers except the first. In 
addition, at each step in the chain, the input (or the resampled Xt) is corrupted with salt- 
and-pepper noise of 40% (i.e., 40% of the pixels are corrupted, and replaced with a 0 or 
a 1 with probability 0.5). Training is over 100 to 600 epochs at most, with good results 
obtained after around 100 epochs, using stochastic gradient descent (minibatch size of one 
example). Hidden layer sizes vary between 1000 and 1500 depending on the experiments, 
and a learning rate of 0.25 and momentum of 0.5 were selected to approximately minimize 
the reconstruction negative log-likelihood. The learning rate is reduced multiplicatively by 
0.99 after each epoch. Following Breuleux et al. (2011), the quality of the samples was 


also estimated quantitatively by measuring the log-likelihood of the test set under a Parzen 
density estimator constructed from 10,000 consecutively generated samples (using the real¬ 
valued mean-held reconstructions as the training data for the Parzen density estimator). 
This can be seen as a lower bound on the true log-likelihood, with the bound converging 
to the true likelihood as we consider more samples and appropriately set the smoothing 
parameter of the Parzen estimator 


Results are summarized in Table As in Section 4.1 the test set Parzen log-likelihood 
bound was not used to select among model architectures, but visual inspection of generated 
samples guided this preliminary search. Optimization hyper-parameters (learning rate, 
momentum, and learning rate reduction schedule) were selected based on the reconstruction 
log-likelihood training objective. The Parzen log-likelihood bound obtained with a two-layer 
model on MNIST is 214 (± standard error of 1.1), while the log-likelihood bound obtained 
by a single-layer model (regular denoising auto-encoder, DAE in the table) is substantially 
worse, at -152±2.2. 


In comparison, Bengio et al. (2013b) report a log-likelihood bound of -244±54 for RBMs 
and 138±2 for a 2-hidden layer DBN, using the same setup. We have also evaluated a 


3-hidden layer DBM (Salakhutdinov and Hinton, 2009), using the weights provided by the 
author, and obtained a Parzen log-likelihood bound of 32±2. See http://www.utstat. 
toronto.edu/~rsalakhu/DBM.html for details. 


Interestingly, the GSN and the DBN-2 actually perform slightly better than when using 
samples directly coming from the MNIST training set, perhaps because the mean-field 
outputs we use are more “prototypical” samples. 


3. However, in this paper, to be consistent with the numbers given in Bengio et al. (2013b I we used a 
Gaussian Parzen density, which makes the numbers not comparable with the AIS log-likelihood upper 
bounds for binarized images reported in other papers for the same data. 
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Figure shows two runs of consecutive samples from this trained model, illustrating 
that it mixes quite well (faster than RBMs) and produces rather sharp digit images. The 
figure shows that it can also stochastically complete missing values: the left half of the 
image was initialized to random pixels and the right side was clamped to an MNIST image. 
The Markov chain explores plausible variations of the completion according to the trained 
conditional distribution. 
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Figure 6: Top: two runs of consecutive samples (one row after the other) generated from 
2-layer GSN model, showing fast mixing between classes and nice sharp images. Note: only 
every fourth sample is shown. Bottom: conditional Markov chain, with the right half of the 
image clamped to one of the MNIST digit images and the left half successively resampled, 
illustrating the power of the generative model to stochastically fill-in missing inputs. 


4.3 Experimental results for GSNs with the scaling factors for walkbacks 


We present the experimental results regarding the discussion in Section 3.3 Experiments 
are done on both MNIST and TFD. For TFD, only the unsupervised part of the dataset 
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Figure 7: These are expanded plots of those in Figure Top: two runs of consecutive 
samples (one row after the other) generated from a 2-layer GSN model, showing that it 
mixes well between classes and produces nice sharp images. Figure contained only one 
in every four samples, whereas here every sample is shown. Bottom: conditional Markov 
chain, with the right half of the image clamped to one of the MNIST digit images and the 
left half successively resampled, illustrating the power of the trained generative model to 
stochastically fill-in missing inputs. Figure showed only 13 samples in each chain; here 
we show 26. 


is used, resulting 69,000 samples for train, 15,000 for validation, and 15,000 for test. The 
training examples are normalized to have a mean 0 and a standard deviation 1. 


For MNIST the GSNs we used have 2 hidden layers with 1000 tanh units each. Salt-and- 
pepper noise is used to corrupt inputs. We have performed extensive hyperparameter search 
on both the input noise level between 0.3 and 0.7, and the hidden noise level between 0.5 
and 2.0. The number of walkback steps is also randomly sampled between 2 and 6. All the 
experiments are done with learning the scaling factors, following the parameterization in 
Section 3.3. 1[ Following previous experiments, the log-probability of the test set is estimated 
by the same Parzen density estimator on consecutive 10,000 samples generated from the 
trained model. The a parameter in the Parzen estimator is cross-validated on the validation 
set. The sampling is performed with ai, the learned scaling factor for the first walkback 
step. The best model achieves a log-likelihood LL=237.44 on MNIST test set, which can 
be compared with the best reported result LL=225 from Goodfellow et al. (2014). 
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Figure 8: Left: consecutive GSN samples obtained after 10 training epochs. Right: GSN 
samples obtained after 25 training epochs. This shows quick convergence to a model that 
samples well. The samples in Figure are obtained after 600 training epochs. 



. A. A A A A A A A A A 



i.* - * • 




k* k.*- 

■I ^ ^ ^ ^ ^ 




L %r t %r ^ ^ * 1 - 


k.' 


i.—A* .iff ^ A —^ t ^ > I* 

A ^ aTa ws Vif mfsrn 


^ 49^ ac .'.jt o.*.. 

_j -i --i i • i. 1 ^ ^- • 


-J i -i i 




;r 




d 


Figure 9: Gonsecutive GSN samples from a model trained on the TFD dataset. At the end 
of each row, we show the nearest example from the training set to the last sample on that 
row to illustrate that the distribution is not merely copying the training set. 


On TFD, we follow a similar procedure as in MNIST, but with larger model capacity 
(GSNs with 2000-2000 tanh units) and a wider hyperparameter range on the input noise 
level (between 0.1 and 0.7), the hidden noise level (between 0.5 and 5.0), and the number 
of walkback steps (between 2 and 6). For comparison, two types of models are trained, one 
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Table 1: Test set log-likelihood lower bound (LL) obtained by a Parzen density estima¬ 
tor constructed using 10,000 generated samples, for different generative models trained on 
MNIST. The LL is not directly comparable to AIS likelihood estimates because we use a 
Gaussian mixture rather than a Bernoulli mixture to compute the likelihood, but we can 
compare with Rifai et al. (2012); Bengio et al. ( 2013b|c ) (from which we took the last three 
columns). A DBN-2 has 2 hidden layers, a CAE-1 has 1 hidden layer, and a CAE-2 has 
2. The DAE is basically a GSN-1, with no injection of noise inside the network. The last 
column uses 10,000 MNIST training examples to train the Parzen density estimator. 



GSN-2 

DAE 

RBM 

DBM-3 

DBN-2 

MNIST 

Log-likelihood lower bound 

214 

-152 

-244 

32 

138 

24 

Standard error 

1.1 

2.2 

54 

1.9 

2.0 

1.6 


with the scaling factor and one without. The evaluation metric is the same as the one used 
in MNIST experiments. We compute the Parzen density estimation on the hrst 10,000 test 
set examples. The best model without learning the scaling factor results in LL = 1044, 
and the best model with learning the scaling factor results in 1215 when the scaling factor 
from the hrst walkback step is used and 1189 when ah the scaling factors are used together 
with their corresponding walkback steps. As two further comparisons, using the mean over 
training examples to train the Parzen density estimator results in LL = 632, and using 
the validation set examples to train the Parzen estimator obtains LL = 2029 (this can be 
considered as an upper bound when the generated samples are almost perfect). Eigure [IT] 
shows the consecutive samples generated with the best model, compared with Figurej^that 
is trained without the scaling factor. In addition, Figure [T0| shows the learned scaling factor 
for both datasets that confirms the hypothesis on the effect of the scaling factors made in 
Section 13.31 


5. Conclusion 

We have introduced a new approach to training generative models, called Generative Stochas¬ 
tic Networks (GSN), which includes generative denoising auto-encoders as a special case 
(with no latent variable). It is an alternative to directly performing maximum likelihood 
on an explicit P{X), with the objective of avoiding the intractable marginalizations and 
partition function that such direct likelihood methods often entail. The training procedure 
is more similar to function approximation than to unsupervised learning because the recon¬ 
struction distribution is simpler than the data distribution, often unimodal (provably so in 
the limit of very small noise). This makes it possible to train unsupervised models that 
capture the data-generating distribution simply using backprop and gradient descent in a 
computational graph that includes noise injection. The proposed theoretical results state 
that under mild conditions (in particular that the noise injected in the networks prevents 
perfect reconstruction), training a sufficient-capacity model to denoise and reconstruct its 
observations (through a powerful family of reconstruction distributions) suffices to capture 
the data-generating distribution through a simple Markov chain. Another view is that we 
are training the transition operator of a Markov chain whose stationary distribution esti- 
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Figure 10: Learned Uk values for each walkback step k. Larger values of Uk correspond 
to greater uncertainty for TFD (real-valued) and less uncertainty for MNIST (binary), 
due to the differing methods of parameterization given in Section 3.3.1 and 3.3. ^ Thus, 
both learned factors reflect the fact that there is greater uncertainty after each consecutive 
walkback step. 


mates the data distribution, which has the potential of corresponding to an easier learning 
problem because the normalization constant for this conditional distribution is generally 
dominated by fewer modes. These theoretical results are extended to the case where the 
corruption is local but still allows the chain to mix and to the case where some inputs are 
missing or constrained (thus allowing to sample from a conditional distribution on a subset 
of the observed variables or to learned structured output models). The GSN framework is 
shown to lend to dependency networks a valid estimator of the joint distribution of the ob¬ 
served variables even when the learned conditionals are not consistent, also allowing to prove 
in a new way the consistency of generalized pseudolikelihood training, associated with the 
stationary distribution of a corresponding GSN (that randomly chooses a subset of variables 
and then resamples it). Experiments have been conducted to validate the theory, in the case 
where the GSN architecture is a simple denoising auto-encoder and in the case where the 
GSN emulates the Gibbs sampling process of a Deep Boltzmann Machine. A quantitative 
evaluation of the samples confirms that the training procedure works very well (in this case 
allowing us to train a deep generative model without layerwise pretraining) and can be used 
to perform conditional sampling of a subset of variables given the rest. After early versions 
of this work were published (Bengio et al., 2014), the GSN framework has been extended and 


applied to classification problems in several different ways (Goodfellow et ah, 2013; Zhou 


and Troyanskaya, 2014; Zohrer and Pernkopf, 2014) yielding very interesting results. In 


addition to providing a consistent generative interpretation to dependency networks, GSNs 


have been used to provide one to Multi-Prediction Deep Boltzmann Machines (Goodfellow 


et ah, 2013) and to provide a fast sampling algorithm for deep NADE (Yao et ah, 2014). 
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Figure 11: Consecutive GSN samples from a model trained on the TFD dataset. The scaling 
factors are learned. The samples are generated by using the scaling factor from the first 
walkback step. Samples are sharper compared with Figure (§. This is also reflected by an 
improvement of 140 in Parzen-estimated log-likelihood. 
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6. Appendix: Argument for consistency based on local noise 


This section presents one direction that we pursed initially to demonstrate that we had 
certain consistency properties in terms of recovering the correct stationary distribution 
when using a hnite training sample. We discuss this issue when we cite Theorem from 


the literature in section |3.4| and thought it would be a good idea to include our previous 
approach in this Appendix. 


The main theorem in Bengio et al. (2013c) (stated in supplemental as Theorem SI) 
requires that the Markov chain be ergodic. A set of conditions guaranteeing ergodicity is 
given in the aforementioned paper, but these conditions are restrictive in requiring that 
C(X\X) > 0 everywhere that P{X) > 0. The effect of these restrictions is that P 0 {X\X) 
must have the capacity to model every mode of P{X), exactly the difficulty we were trying 
to avoid. We show here how we may also achieve the required ergodicity through other 
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means, allowing us to choose a C(X\X) that only makes small jumps, which in turn only 
requires Pq{X\X) to model a small part of the space around each X. 


Let P 0 ^{X\X) be a denoising auto-encoder that has been trained on n training examples. 
Pg^(X\X) assigns a probability to X, given X, when X ~ C{X\X). This estimator dehnes 
a Markov chain obtained by sampling alternatively an X from C{X\X) and an X from 
Pq{X\X). Let TTn be the asymptotic distribution of the chain defined by T„, if it exists. 


The following theorem is proven by Bengio et al. (2013c) 


Theorem SI If P 0 ^(X\X) is a consistent estimator of the true conditional distribution 
P{X\X) and Tn defines an ergodic Markov chain, then as n ^ oo, the asymptotic distri¬ 
bution TTn{X) of the generated samples converges to the data-generating distribution P{X). 


In order for Theoremto apply, the chain must be ergodic. One set of conditions under 
which this occurs is given in the aforementioned paper. We slightly restate them here: 






X (arbitrary units) 


Figure 

2013c[), 


12: U C{X\X) is globally supported as required by Corollary 10 (Bengio et al 


then for P 0 ^(X\X) to converge to P{X\X), it will eventually have to model all of 


the modes in P{X), even though the modes are damped (see “leaky modes” on the left). 
However, if we guarantee ergodicity through other means, as in Corollary |11[ we can choose 
a local C{X\X) and allow P 0 ^(X\X) to model only the local structure of P{X) (see right). 
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Corollary 10 If the support for both the data-generating distribution and denoising model 
are contained in and non-zero in a finite-volume region V (i.e., MX, MX ^ V, P{X) = 
0,Pe{X\X) = 0 and'iX, VX e V, P(X) > O,P 0 (XlX) > 0,C(X|X) > Oj and these 
statements remain true in the limit of n ^ oo, then the chain defined by will be ergodie. 


If conditions in Corollary 10 apply, then the chain will be ergodie and Theorem SI will 
apply. However, these conditions are sufficient, not necessary, and in many cases they may 
be artificially restrictive. In particular. Corollary |10| defines a large region V containing any 
possible X allowed by the model and requires that we maintain the probability of jumping 
between any two points in a single move to be greater than 0. While this generous condition 
helps us easily guarantee the ergodicity of the chain, it also has the unfortunate side effect of 
requiring that, in order for P 0 ^{X\X) to converge to the conditional distribution P{X\X), 
it must have the capacity to model every mode of P{X), exactly the difficulty we were 
trying to avoid. The left two plots in Figure 12 show this difficulty: because C{X\X) > 0 


everywhere in V, every mode of P{X) will leak, perhaps attenuated, into P{X\X). 


Fortunately, we may seek ergodicity through other means. The following corollary allows 
us to choose a C{X\X) that only makes small jumps, which in turn only requires Pe{X\X) 
to model a small part of the space V around each X. 

Let P0^{X\X) be a denoising auto-encoder that has been trained on n training examples 
and C(X\X) be some corruption distribution. Pg^{X\X) assigns a probability to X, given 
X, when X ~ C(X\X) and X ~ V{X). Define a Markov chain by alternately sampling 
an X from C(X\X) and an X from Pg{X\X). 


Corollary 11 If the data-generating distribution is contained in and non-zero in a finite- 
volume region V (i.e., MX ^ V, P(X) = 0, andMX G V, P{X) > 0) and all pairs of points 
in V can be connected by a finite-length path through V and for some e > 0, MX G V, MX G V 
within e of each other, C{X\X) > 0 and Pg{X\X) > 0 and these statements remain true in 
the limit of n ^ oo, then the chain defined by Tn will be ergodie. 


Proof Consider any two points Xa and Xg in V. By the assumptions of Corollary|ll[ there 
exists a finite length path between Xa and Xg through V. Pick one such finite length path 
P. Chose a finite series of points x = {xi, X 2 ,..., x^} along P, with xi = Xa and Xk = Xg 
such that the distance between every pair of consecutive points (xj,Xi+i) is less than e as 
defined in Corollary 11, Then the probability of sampling X = Xj+i from C{X\xi)) will be 


positive, because C{X\X)) > 0 for all X within e of X by the assumptions of Corollary 11 
Further, the probability of sampling X = X = Xj+i from Pg{X\X) will be positive from the 
same assumption on P. Thus the probability of jumping along the path from x* to Xj+i, 
T„(Xt_|_i = Xj+i|Xt = Xi), will be greater than zero for all jumps on the path. Because 
there is a positive probability finite length path between all pairs of points in V, all states 
commute, and the chain is irreducible. If we consider Xa = Xg G V, by the same arguments 
Tn{Xt = Xa\Xt-i = Xa) > 0. Because there is a positive probability of remaining in the 
same state, the chain will be aperiodic. Because the chain is irreducible and over a finite 
state space, it will be positive recurrent as well. Thus, the chain defined by Tn is ergodie. 
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Although this is a weaker condition that has the advantage of making the denoising 
distribution even easier to model (probably having less modes), we must be careful to 
choose the ball size e large enough to guarantee that one can jump often enough between 
the major modes of P{X) when these are separated by zones of tiny probability, e must 
be larger than half the largest distance one would have to travel across a desert of low 
probability separating two nearby modes (which if not connected in this way would make V 
not anymore have a single connected component). Practically, there is a trade-off between 
the difficulty of estimating P{X\X) and the ease of mixing between major modes separated 
by a very low density zone. 
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